Chris Pollett > Old
Classes > |
HW#2 --- last modified February 10 2019 22:00:08..Due date: Sep 28
Files to be submitted: Purpose: Related Course Outcomes: The main course outcomes covered by this assignment are: LO1 -- Code a basic inverted index capable of performing conjunctive queries. LO2 -- Be able to calculate by hand on small examples precision (fraction relevant results returned), recall (fraction of results which are relevant), and other IR statistics. LO6 -- Be able to evaluate search results by hand and using TREC eval software. Specification: For the first part of your homework I want you to expand upon the code you wrote for HW1. Now your program should be run from the command-line with a command like: program_name folder_to_index query rank_algorithm For C/C++, program name can be just the name of your program. For Java and PHP, the program name can be "java program_name" or "php program_name". Again, you must program in either C/C++, Java, or PHP and your program should work with a vanilla install of an at most one year old version of gcc/g++/Oracle's Java 1.6/Zend's php. The folder_to_index is, as in the last assignment, a folder containing text files. These will have all have the extension .txt and other files in this folder should be ignored. You should have a readme.txt file saying how to compile and run your program. For this homework, your program does not output the inverted index and related statistics to it. Instead, I want you to store these informations into a class which has the four public methods first, last, prev, and next, described in class for the inverted index ADT. Your implementation of prev and next should use galloping search. The command-line argument query above can be either a single word for example "bob" or it can be a string of words for example "bob smith". If it is more that one term, when I run your program I will put the sequence of terms in double-quotes, so that it will come in to your program as a single command-line argument. Finally, rank_algorithm is either the word "cosine" or the word "proximity". As an example, I might run your program with a line like: php search_program.php test_folder "apple records" cosine On such an input, your program should return a sequence of lines, each line containing the pair (some_filename, rank). Here some_filename should be a file which contains the word apple and contains the word records (not necessarily adjacent to each other). Here rank should be the cosine rank score for that document with respect to this query. You should output results in descending order of rank. In computing cosine rank, use TF_IDF values for the vector component where the TF function is the example one given in the book. If proximity is used rather than cosine, I want you to output the results according to the proximity ranking algorithm given in chapter 2 of the book. To compute only those documents in which all the terms appear I want you to use only the functions of the ADT you have built. To do this rewrite the exact phrase algorithm given in class to instead compute this kind of conjunctive query. Do this in pseudo-code as well as your own code and put your pseudo code in the file conjunctive.txt which you submit as part of your ZIP file. As the last part of your assignment, I want you to create a small text collection of 20 files and put them in a folder test_folder as part of what you submit. Label your files 1.txt through 20.txt Make up a list of five queries (at least two involving more than one term). For each of your queries and each of your files make a 1 (yes), 0 (no) judgment about the relevance of that file to that query. Each of your queries should have the key words searched on as well as a description of the words you expect to get back. Put your queries, and descriptions in queries.txt. For instance, you might have a query "apple records", with a description of the query being the record company of the Beatles. So a document about Apple computers which mentions the word records should not be considered a relevant result for this query. Have a text file judgments.txt, with lines like (query_no, doc_no, relevance) which records this information. Next I want you to compute by hand the recall, precision, and MAP scores of your search program above with respect to each query using this information and what it returns. Put your calculations into the file scores.txt All components of the grade breakdown below will be graded as either 0- did not work/do, 1/2 - partially works/partially achieved, 1 - full credit Point Breakdown
|